A Quantitative Analysis of Measures of Quality in Science 



Sune LehmanrQ 

Informatics and Mathematical Modeling, Technical University of Denmark, Building 321, DK-2800 Kgs. Lyngby, Denmark. 

Andrew D. Jackson and Benny E. Lautrup 

The Niels Bohr Institute, Blegdamsvej 17, DK-2100 K0benhavn 0, Denmark. 
(Dated: February 2, 2008) 

Condensing the work of any academic scientist into a one-dimensional measure of scientific quality is a diffi- 
cult problem. Here, we employ Bayesian statistics to analyze several different measures of quality. Specifically, 
we determine each measure's ability to discriminate between scientific authors. Using scaling arguments, we 
demonstrate that the best of these measures require approximately 50 papers to draw conclusions regarding long 
term scientific performance with usefully small statistical uncertainties. Further, the approach described here 
permits the value-free (i.e., statistical) comparison of scientists working in distinct areas of science. 
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I. INTRODUCTION 

It appears obvious that a fair and reliable quantification 
of the 'level of excellence' of individual scientists is a near- 
impossible task JH Q. El 0. Btl • Most scientists would agree on 
two qualitative observations: (i) It is better to publish a large 
number of articles than a small number, f ii) For any given pa- 
per, its citation count — relative to citation habits in the field in 
which the paper is published — provides a measure of its qual- 
ity. It seems reasonable to assume that the quality of a scientist 
is a function of his or her full citation record 1 . The question 
is whether this function can be determined and whether quan- 
titatively reliable rankings of individual scientists can be con- 
structed. A variety of 'best' measures based on citation data 
have been proposed in the literature and adopted in practice 
10,0]. The specific merits claimed for these various measures 
rely largely on intuitive arguments and value judgments that 
are not amenable to quantitative investigation. (Honest people 
can disagree, for example, on the relative merits of publishing 
a single paper with 1000 citations and publishing 10 papers 
with 100 citations each.) The absence of quantitative support 
for any given measure of quality based on citation data is of 
concern since such data is now routinely considered in mat- 
ters of appointment and promotion which affect every work- 
ing scientist. 

Citation patterns became the target of scientific scrutiny in 
the 1960s as large citation databases became available through 
the work of Eugene Garfield [8] and other pioneers in the 
field of bibliome tries. A surprisingly, large body of work on 
the statistical analysis of citation data has been performed by 
physicists. Relevant papers in this tradition include the pio- 
neering work of D. J. de Solla Price, e.g. Jgt], and, more re- 
cently, J2I OH [01 [HI]- In addition, physicists are a driving 
force in the emerging field of complex networks. Citation net- 
works represent one popular network specimen in which pa- 
pers correspond to nodes connected by references (out-links) 



and citations (in-links). Citation networks have frequently 
been used as an example of growing networks with preferen- 
tial attachme nt 111311 . For reviews on this extensive subject, 
see Ikil [l5l [TaK The aim of the present paper is to take 
such studies in a novel direction by addressing the question 
of which one-dimensional measure of citation data is best in 
a manner which is both quantitative and free of value judg- 
ments. Given the remarks above, the ability to answer this 
question depends on a careful definition of the word 'best' . 

The primary purpose of analyzing and comparing the cita- 
tion records of individual scientists is to discriminate between 
them, i.e., to assign some measure of quality and its associ- 
ated uncertainty to each scientist considered. Whatever the 
intrinsic and value-based merits of the measure, m, assigned 
to every author, it will be of no practical value unless the corre- 
sponding uncertainty, 8m is sufficiently small. From this point 
of view, the best choice of measure will be that which pro- 
vides maximal discrimination between scientists and hence 
the smallest value of 8m. We will demonstrate that the ques- 
tion of deciding which of several proposed measures is most 
discriminating, and therefore 'best', can be addressed quanti- 
tatively using standard statistical methods. 

Although the approach is straightforward, it is useful first 
to describe it in general. We begin by binning all authors by 
some tentative measure, m, of the quality of their full citation 
record. The probability that an author will lie in bin a is de- 
noted p(a). Similarly, we bin each paper according to the total 
number of citations 2 . The full citation record for an author is 
simply the set {«,}, where n, is the number of his/her paper 
in citation bin i. For each author bin, a, we then empirically 
construct the conditional probability distribution, P(z'|a), that 
a single paper by an author in this bin will lie in citation bin 
i. These conditional probabilities are the central ingredient in 
our analysis. They can be used to calculate the probability, 
P({«,}|oc), that any full citation record was actually drawn at 
random on the conditional distribution, P{i\<x) appropriate for 
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1 Citation data is, in fact, publicly available for all academic scientists. 



2 We use the Greek alphabet when binning with respect to to m and the Ro- 
man alphabet for binning citations. 
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a fixed author bin, a. Bayes' theorem allows us to invert this 
probability to yield 

P(a\{ni})~P({m}\a)p{a), (1) 

where P(tt\ {«,}) is the probability that the citation record {«,} 
was drawn at random from author bin a. By considering the 
actual citation histories of authors in bin p, we can thus con- 
struct the probability P(a|p), that the citation record of an au- 
thor initially assigned to bin p was drawn on the the distribu- 
tion appropriate for bin a. In other words, we can determine 
the probability that an author assigned to bin p on the basis 
of the tentative quality measure should actually be placed in 
bin a. This allows us to determine both the accuracy of the 
initial author assignment its uncertainty in a purely statistical 
fashion. 

While a good choice of measure will assign each author to 
the correct bin with high probability this will not always be the 
case. Consider extreme cases in where we elect to bin authors 
on the basis of measures unrelated to scientific quality, e.g., 
by hair/eye color or alphabetically. For such measures P(i\(X) 
and P({nt}\a) will be independent of a, and P(a|{«,}) will 
become proportional to prior distribution p(a). As a conse- 
quence, the proposed measure will have no predictive power 
whatsoever. It is obvious, for example, that a citation record 
provides no information of its author's hair/eye color. The 
utility of a given measure (as indicated by the statistical ac- 
curacy with which a value can be assigned to any given au- 
thor) will obviously be enhanced when the basic distributions 
P(i\o.) depend strongly on a. These differences can be for- 
malized using the standard Kullback-Leibler divergence. As 
we shall see, there are significant variations in the predictive 
power of various familiar measures of quality. 

The organization of the paper is as follows. Section HT1 is 
devoted to a description of the data used in the analysis, Sec- 
tion|nI]introduces the various measures of quality that we will 
consider. In Sections [IV] and [V] we provide a more detailed 
discussion of the Bayesian methods adopted for the analysis 
of these measures and a discussion of which of these measures 
is best in the sense described above of providing the maximum 
discriminatory power. This will allow us in Section[VT]to ad- 
dress to the question of how many papers are required in order 
to make reliable estimates of a given author's scientific qual- 
ity; finally, Section [A] discusses the origin of asymmetries in 
some the measures. A discussion of the results and various 
conclusions will be presented in Section [VII] 



II. DATA 

The analysis in this paper is based on data from the 
SPIRES 3 database of papers in high energy physics. Our data 
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FIG. 1 : Logarithmically binned histogram of the citations counts of 
all papers by authors with more than 25 publications in the theory 
subsection of SPIRES. The data is normalized and the axes are loga- 
rithmic. 



set consists of all citable papers written by academic scien- 
tists from the theory subfield, ultimo 2003. All citations to 
papers outside of SPIRES were removed. In the context of 
this paper, we define an academic scientist as someone who 
has published 25 papers or more. This definition is intended 
to include almost everyone with a permanent academic po- 
sition and exclude those who leave academia early in their 
careers (and generally cease active journal publication) in the 
interests of maintaining the homogeneity of the data sample. 
For more see 11711 . Chapters 3 and 4. The resulting data set 
includes 6737 authors and a total of 274470 papers. The ac- 
tual number of papers is smaller than this since each multiple 
author paper is counted once per co-author. The theory sub- 
field is, however, that part of high energy physics where this 
effect is least pronounced. This is due to the relatively small 
number of co-authors (typically 1—3) per theoretical paper. 
In the case of the theory subfield, this weighting of papers by 
the number of co-authors has been shown to have negligible 
effects d. 

The theory subsection of the SPIRES data has a power-law 
structure. Specifically the probability that a paper will re- 
cieve k citations is approximately proportional to (k + 
with y = 1.11 for k < 50 and y = 2.78 for k > 50. The 
transition between these two power laws is found to be sur- 
prisingly sharp jlin . These features of the global distribu- 
tion are also present in the conditional probabilities for sub- 
groups of authors binned according to most measures of qual- 
ity. In virtually all cases, these conditional probabilities can 
also be described accurately by separate power laws in each 
of two regions with a relatively sharp transition between the 
regions. As one might expect, authors with more citations 
are described by flatter distributions (i.e., smaller values of 
y) and a somewhat higher transition point. Figure Q] displays 
the total distribution of citations as a binned and normalized 



3 SPIRES is an acronym for Stanford Physics Information RE- 
trieval System. The database is open and can be found at 
http://www.slac.stanford.edu/spires/ Citations in SPIRES are gath- 
ered only from the papers in the database that have references entered 



electronically via eprints or journal articles, publications such as mono- 
graphs or conference proceedings are treated inconsistently and therefore 
not included in this study. 
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FIG. 2: Logarithmically binned histogram of the citations in bin 6 
of the median measure. The A points show the citation distribution 
of the first 25 papers by all authors. The points marked by ~k show 
the distribution of citations from the first 50 papers by authors who 
have written more than 50 papers. Finally, the □ data points show 
the distribution of all papers by all authors. The axes are logarithmic. 



histogram 4 . 

Studies performed on the first 25, first 50 and all papers for 
a given value of m show the absence of temporal correlations. 
It is of interest to see this explicitly. Consider the following 
example. In Figure [2] we have plotted the distribution for bin 
6 of the median measure 5 . There are 674 authors in this bin. 
Two thirds of these authors have written 50 papers or more. 
Only this subset is used when calculating the first 50 papers 
results. In this bin, the means for the total, first 25 and first 
50 papers are 11.3, 12.8, and 12.9 citations per paper, respec- 
tively. The median of the distributions are 4, 6, and 6. The 
plot in Figure [2] confirms these observations. The remaining 
bins and the other measures yield similar results. 

Note that Figure [2] confirms the general observations on the 
shapes of the conditional distributions made above. Figure [2] 
also shows two distinct power-laws. Both of the power-laws in 
this bin are flatter than the ones found in the total distribution 
and the transition point is lower than in the total distribution 
from FigureQ] 



HI. MEASURES OF SCIENTIFIC EXCELLENCE 

Despite differing citation habits in different fields of sci- 
ence, most scientists agree that the number of citations of a 
given paper is the best objective measure of the quality of that 
paper. The belief underlying the use of citations as a measure 
of quality is that the number of citations to a paper provides 



an indication of how often the content of that paper has been 
used in the work of others 6 . Note, however, the obvious fact 
that citations can only be interpreted as a meaningful proxy 
of quality relative to the citation habits of one's peers or, put 
slightly differently, in the context of the citation habits of the 
field in which the paper is published. In [ 1 1], we have shown 
that the theory subsection of SPIRES is indeed a very homo- 
geneous data set. In this sense, we will assume that the cita- 
tion count of a paper is a proxy of the intrinsic quality of that 
paper. 

The questions remain, however, of how to extract a measure 
of the quality of an individual scientist from his citation record 
and how fairly to project this record onto a scalar measure. 
This question is non-trivial because the probability, p(k) of 
finding a scientific paper with k citations roughly follows an 
asymptotic power-law distribution, see Figs.Q]and[2] This fact 
was documented for the SPIRES data in Ref. full and holds 
true in many other scientific fields J^, [T(| Il6ll . Thus, it is 
useful to consider some of the properties of the distribution of 
citations for all authors before discussing the various specific 
measures of quality to be considered here. 

Empirical evidence indicates that most citation distributions 
are largely power-law distributed with p(k) ~ k~Y. For small 
values of k, y w 1; for larger values, 2 < y < 3. Although 
the average number of citations per paper is well-defined, the 
asymptotic power-law tails of these distributions cause their 
variance to be infinite 7 . When the variance is not defined (or 
very large), the mean values of a finite sample fluctuate sig- 
nificantly as a function of sample size. As a consequence, 
the average number of citations, (k), in the citation record 
of a given author (which is precisely a finite sample drawn 
from a power-law probability distribution) is a potentially un- 
reliable measure of the quality of an author's citation record 
since the addition or removal of a single highly cited paper can 
materially alter an author's mean. Nevertheless, the mean of 
an author's citations is commonly used as an intensive scalar 
measure of author quality. 

The reservations just expressed about the use of mean cita- 
tions per paper apply with even greater force if one chooses 
to measure author quality by the number of citations of each 
author's single most highly cited paper, k max . Virtually all 
of the stabilizing statistical power of the full citation record 
has been discarded, and even greater fluctuations can be ex- 
pected in this measure as the sample size changes. In spite of 
such statistical arguments, there are reasons for considering 
the maximum cited paper as a measure of quality. It is per- 
fectly tenable to claim that the author of a single paper with 



4 Due to matters of visual presentation, the binning used in this and the fol- 
lowing figure here is different from the binning used when constructing 
the P(i\o) used later in the paper. The correct binning is described in Ap- 
pendi)fBl 

5 Since this plot is constructed from authors assigned to bin 6, each paper is 
weighted by the number of its authors present in this bin. Weighing papers 
by the number of co-authors, however, does not significantly change the 
distribution of citations I ll|| . 



6 We realize that there are a number of problems related to the use of cita- 
tions as a proxy for quality. Papers may be cited or not for reasons other 
than their high quality. Geo- and/or socio-political circumstances can keep 
works of high quality out of the mainstream. Credit for an important idea 
can be attributed incorrectly. Papers can be cited for historical rather than 
scientific reasons. Indeed, the very question of whether authors actually 
read the papers they cite is not a simple one [18]. Nevertheless, we assume 
that correct citation usage dominates the statistics. 

7 Diverging higher moments of power-law distributions are discussed in the 
literature. E.g. 11911 . 



4 



1 000 citations is of greater value to science than the author of 
10 papers with 100 citations each (even though the latter is far 
less probable than the former). In this sense, the maximally 
cited paper might provide better discrimination between au- 
thors of 'high' and 'highest' quality, and this measure merits 
consideration. 

Another simple and widely used measure of scientific ex- 
cellence is the average number of papers published by an au- 
thor per year. This would be a good measure if all papers 
were cited equally. As we have just indicated, scientific pa- 
pers are emphatically not cited equally, and few scientists hold 
the view that all published papers are created equal in quality 
and importance. Indeed, roughly 50% of all papers in SPIRES 
are cited < 2 times (including self-citation). This fact alone is 
sufficient to invalidate publication rate as a measure of sci- 
entific excellence. If all papers were of equal merit, citation 
analysis would provide a measure of industry rather than one 
of intrinsic quality. 

In an attempt order to remedy this problem, Thomson Sci- 
entific (ISI) introduced the Impact Factor^ which is designed 
to be a "measure of the frequency with which the 'average 
article' in a journal has been cited in a particular year or pe- 
riod" 9 . The Impact Factor can be used to weight individual 
papers. Unfortunately, citations to articles in a given journal 
also obey power-law distributions B12I1 . This has two conse- 
quences. First, the determination of the Impact Factor is sub- 
ject to the large fluctuations which are characteristic of power- 
law distributions. Second, the tail of power-law distributions 
displaces the mean citation to higher values of k so that the 
majority of papers have citation counts that are much smaller 
than the mean. This fact is for example expressed in the large 
difference between mean and median citations per paper. For 
the total SPIRES data base, the median is 2 citations per pa- 
per; the mean is approximately 15. Indeed, only 22% of the 
papers in SPIRES have a number of citations in excess of the 
mean, cf. 01 111 . Thus, the dominant role played by a relatively 
small number of highly cited papers in determining the Impact 
Factor implies that it is subject to relatively large fluctuations 
and that it tends overestimate the level of scientific excellence 
of high impact journals. This fact was directly verified by 
Seglen [20], who showed explicitly that the citation rate for 
individual papers is uncorrelated to the impact factor of the 
journal in which it was published. 

An alternate way to measure excellence is to categorize 
each author by the median number of citations of his papers, 
Clearly, the median is far less sensitive to statistical fluc- 
tuations since all papers play an equal role in determining its 
value. To demonstrate the robustness of the median, it is use- 
ful to note that the median of 5\£ = 2N + 1 random draws on 
any normalized probability distribution, q(x), is normally dis- 
tributed in the limit 5\£ — > °°. To this end we define the integral 



For a full definition see http://scientific.thomson.com/ 
9 Ibid. 



of q(x) as 

Q(x)= J X q(x')dx' (2) 

Evidently, Q(x) grows monotonically from to 1 independent 
of q(x). The 'median' of this sample is defined as that value 
of x such that ( i) one draw has the value x, ( ii) N draws have 
a value less than or equal to x, and ( Hi) N draws have a value 
greater than or equal to x. The probability that the median is 
at x is now given as 

P*y 2 (x) = {2 l ^ l q(x)Q(xf[l - Q(x)f . (3) 

For large N, the maximum of P Xl , 2 (x) occurs at x = X\ji where 
Q{x\/2) = 1 /2. Expanding P x ^ 2 (x) about its maximum value, 
we see that 

(4) 

A similar argument applies for every percentile. The statis- 
tical stability of percentiles suggests that they are well-suited 
for dealing with the power laws which characterize citation 
distributions. 

Recently, Hirsch J3l proposed a different measure, h, in- 
tended to quantify scientific excellence. Hirsch's definition is 
as follows: "A scientist has index h if h of his/her N p papers 
have at least h citations each, and the other (N p — h) papers 
have fewer than h citations each"||7|. Unlike the mean and the 
median, which are intensive measures largely constant in time, 
h is an extensive measure which grows throughout a scientific 
career. Hirsch assumes that h grows approximately linearly 
with an author's professional age, defined as the time between 
the publication dates of the first and last paper. Unfortunately, 
this does not lead to an intensive measure. Consider, for exam- 
ple, the case of authors with large time gaps between publica- 
tions, or the case of authors whose citation data are recorded 
in disjoint databases. A properly intensive measure can be 
obtained by dividing an author's /i-index by the number of 
his/her total publications. We will consider both approaches 
below. 

The /i-index represents an attempt to strike a balance be- 
tween productivity and quality and to escape the tyranny of 
power law distributions which place strong weight on a rel- 
atively small number of highly cited papers. The problem 
is that Hirsch assumes an equality between incommensurable 
quantities. An author's papers are listed in order of decreasing 
citations with paper i having C(i) citations. Hirsch's measure 
is determined by the equality, h = C(h), which posits an equal- 
ity between two quantities with no evident logical connection. 
While it might be reasonable to assume that W ^C(h), there is 
no reason to assume that y and the constant of proportionality 
are both 1 . 

We will also include one intentionally nonsensical choice 
in the following analysis of the various proposed measures of 
author quality. Specifically, we will consider what happens 
when authors are binned alphabetically. In the absence of his- 
torical information, it is clear that an author's citation record 
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should provide us with no information regarding the author's 
name. Binning authors in alphabetic order should thus fail any 
statistical test of utility and will provide a useful calibration 
of the methods adopted. The measures of quality described in 
this section are the ones we will consider in the remainder of 
this paper. 

IV. A BAYESIAN ANALYSIS OF CITATION DATA 

The rationale behind all citation analyses lies in the fact 
that citation data is strongly correlated such that a 'good' 
scientist has a far higher probability of writing a good (i.e., 
highly cited) paper than a 'poor' scientist. Such correlations 
are clearly present in SPIRES ifllL Bill . We thus categorize 
each author by some tentative quality index based on their to- 
tal citation record. Once assigned, we can empirically con- 
struct the prior distribution, p(a), that an author is in author 
bin a and the probability P(N\tt) that an author in bin a has 
a total of N publications. We also construct the conditional 
probability P(i\d) that a paper written by an author in bin a 
will lie in citation bin i. As we have seen earlier, studies per- 
formed on the first 25, first 50 and all papers of authors in a 
given bin reveal no signs of additional temporal correlations 
in the lifetime citation distributions of individual authors. In 
performing this construction, we have elected to bin authors in 
deciles. We bin papers into L bins according to the number of 
citations. The binning of papers is approximately logarithmic 
(see Appendix A). We have confirmed that the results stated 
below are largely independent of the bin-sizes chosen. 

We now wish to calculate the probability, P({n,}|a), that 
an author in bin a will have the full (binned) citation record 
{/if}. In order to perform this calculation, we assume that the 
various counts «, are obtained from N independent random 
draws on the appropriate distribution, P(i\a). Thus, 

P{{ni}\a) =P(N\a)N\fl^^. (5) 

i=l \ n i>- 

Although large scale temporal correlations are known to be 
absent, transient correlations are possible. For example, one 
particularly well-cited paper could lead to an increased prob- 
ability of high citations for its immediate successor(s). It is 
difficult to demonstrate their presence or absence, but the re- 
sults of following section will provide a posteriori evidence 
that such correlations, if present, are not important. 

We can now invert the probability P({«/}|a) using Bayes' 
theorem to obtain 

*(«i{*}) = f({ ff'y a) 

n{»»}) 

= p{a.)P{N\a)Y\jPU\Q) n J 

where we have inserted Eq. (0 and used marginalization to 
obtain the normalization. The combinatoric factors cancel. 
The quantity P(a| {«,}), which represents the probability that 
an author with binned citation record {«,} is in author bin a. 
It can be used in two ways — each of which is interesting. 



For any measure chosen Eq. © provides us with the prob- 
ability that an author lies in author bin a. While the value of 
any measure (such as the mean number of citations per paper) 
can be calculated directly, the calculated values of P(a|{«,}) 
provide far more detailed and more reliable information using 
all statistical information contained in the data. The large fluc- 
tuations which can be encountered in identifying authors by 
their mean citation rate or by their maximally cited paper are 
reduced. Further, by providing us with values of P{o\ for 
all a, we obtain a statistically trustworthy gauge of whether 
the resulting uncertainties in a are sufficiently small for the 
measure under consideration to be a reliable indicator of au- 
thor quality. In short, Eq. © provides us with a measure of 
an author's ranking independent of the total number papers 
currently published, and with information which allows us to 
assess the reliability of this determination. The accuracy of 
the resulting value of a increases dramatically with the total 
number of published papers. We will return to this point in 
Section V. 

Fig. [3] shows the probabilities P(oc| {«,■}) that A will lie in 
each of the decile bins using the measures discussed in section 
II. These measures include: (a) the first initial of the author's 
name, (b) the average yearly output of papers, (c) Hirsch's h 
normalized by the author's professional age T, (d) the /i-index 
normalized by the number of published papers, (e) the citation 
count of the single most cited paper, (f) the mean number of 
citations per paper, (g) the median number (50th percentile) 
of citations per paper, and (h) a 65th percentile measure. It is 
clear from the figure that there are significant differences, both 
in the accuracy of of the initial assignments and, more impor- 
tantly, in the corresponding uncertainties. Large uncertainties 
are due to the fact that the conditional probabilities, P(i\ct) 
are largely independent of a. Such independence is to be ex- 
pected in the case of the alphabetic binning of authors, where 
the inability of the citation record to identify the first initial of 
author A's name is hardly surprising. The figure also suggests 
that the number of papers published per year is not reliable. 
Initial assignments of author A based on mean, median, 65th 
percentile, and maximum citations all appear to provide an 
accurate reflection of his full citation record with a satisfac- 
torily small uncertainty. Hirsch's measures falls somewhere 
between the best and worst choice of measures. 

Given the large variations in the accuracy and confidence 
of decile assignments as a function of the measure selected, 
it is of interest to investigate in greater detail the question of 
which of these measures is best. We address this question in 
the next section. 



V. WEIGHING THE MEASURES 

In order to obtain a more graphic representation of the qual- 
ity of a given measure, we calculate the probability, P(p|oc), 
that an author initially assigned to bin a is predicted to lie 
in bin p. In practice, we determine P(p|oc) as the average of 
the probability distributions P(p|{«,}) for each author in bin 
a. The results are shown 'stacked' in Fig. [4] for the various 
measures considered. Here, row a shows the (average) prob- 
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FIG. 3: A single author example. We analyze the citation record of author A with respect to the eight different measures defined in the text. 
Author A has written a total of 88 papers. The mean of this citation record is 26 citations per paper, the median is 13 citations, the h-'mdex is 
29, the maximally cited paper has 187 citations, and papers have been published at the average rate of 2.5 papers per year. The various panels 
show the probability that author A belongs to each of the ten deciles given on the corresponding measure; the vertical arrow displays the initial 
assignment. Panel (a) displays /'(first initial|A) (b) shows P(papers per year|A), (c) shows P(h/T\A), (d) shows P(h/N\A), panel (e) shows 
P{k m!a \A), panel (fj displays P((k)\A), (g) shows P(k l/2 \A) , and finally (h) shows P(k 65 \A). 



abilities that an author initially assigned to bin a belongs in 
decile bin p. This probability is proportional to the area of the 
corresponding squares. Obviously, a perfect measure would 
place all of the weight in the diagonal entries of these plots. 
Weights should be centered about the diagonal for an accurate 
identification of author quality and the certainty of this iden- 
tification grows as weight accumulates in the diagonal boxes. 
Note that an assignment of a decile based on Eq. © is likely 
to be more reliable than the value of the initial assignment 
since the former is based on all information contained in the 
citation record. 

Figure |4]emphasizes that 'first initial' and 'publications per 
year' are not reliable measures. The /i-index normalized by 
professional age performs poorly; when normalized by num- 
ber of papers, the trend towards the diagonal is enhanced. We 
note the appearance of vertical bars in each figure in the top 
row. This feature is explained in Appendix [A] All four mea- 
sures in the bottom row perform fairly well. The initial as- 
signment of the k max measure always underestimates an au- 
thor's correct bin. This is not an accident and merits comment. 
Specifically, if an author has produced a single paper with ci- 
tations in excess of the values contained in bin a, the prob- 
ability that he will lie in this bin, as calculated with Eq. 
is strictly 0. Non-zero probabilities can be obtained only for 
bins including maximum citations greater than or equal to the 
maximum value already obtained by this author. (The fact that 
the probabilities for these bins shown in Fig.|4]are not strictly 
is a consequence of the use of finite bin sizes.) Thus, binning 
authors on the basis of their maximally cited paper necessarily 
underestimates their quality. The mean, median and 65th per- 
centile appear to be the most balanced measures with roughly 
equal predictive value. 

It is clear from Eq. (|6]i that the ability of a given measure to 
discriminate is greatest when the differences between the con- 



ditional probability distributions, P(i\a), for different author 
bins are largest. These differences can quantified by measur- 
ing the 'distance' between two such conditional distributions 
with the aid of the Kullback-Leibler (KL) divergence (also 
know as the relative entropy). The KL divergence between 
two discrete probability distributions, p and p' is defined 10 as 

KL[py]=£>ln(j). (7) 

The Kullback-Leibler divergence is positive and has desirable 
convexity properties. It is, however, not a metric due to the 
fact thatKL[p',p] ^ KL[p,p']. While this asymmetry is of lit- 
tle concern when the differences between p and p' are small, 
some care is required when such differences are large. This 
can occur when the data set is so small that some citation 
bins are empty or when we bin authors by k max , in which case 
empty bins are inevitable as noted above. We consider the KL 
distance between adjacent distributions, Fig. [5] shows the dis- 
tances KL[P(i\a),P(i\a+ 1)] for various measures. The prob- 
ability P(p = a+ l|a) is exponentially sensitive to the KL 
divergence. Measures with large KL divergences between ad- 
jacent bins provide the most certain assignments of authors. 
The KL divergences for the measures not shown are signifi- 
cantly smaller than those displayed. The results of Fig.[5]pro- 
vide quantitative support for the roughly equal performance 
of mean, median, and 65th percentile measures 11 seen in Fig- 
ure |U The /i-index normalized by number of publications is 



The non-standard choice of the natural logarithm rather than the logarithm 
base two in the definition of the KL divergence, will be justified below. 
Figure \5\ gives a misleading picture of the k m!lx measure, since the KL di- 
vergences KL[P(r'|a+ l),P(;'|cc)] are infinite as discussed above. 
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FIG. 4: Eight different measures. Each horizontal row shows the average probabilities (proportional to the areas of the squares) that authors 
initially assigned to decile bin a are predicted to belong in bin (3. Panels as in Fig. 3. 
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FIG. 5: The Kullback-Leibler divergences Kh[P(i\a),P(i\a + 1)]. 
Results are shown for the following distributions: h-'mdex normal- 
ized by number of publications, maximum number of citations, 
mean, median, and 65th percentile. 




FIG. 6: Binning according to deciles. This plot displays a normal 
distribution (solid black line) as an example of a probability distri- 
bution peaked around a non-zero maximum. The grey vertical lines 
mark the boundaries of the 10 deciles. 



dramatically smaller than the other measures shown except for 
the extreme deciles. 

The reduced ability of all measures to discriminate in the 
middle deciles is immediately apparent from Fig. [5] This is a 
direct consequence any percentile binning given that the dis- 
tribution of author quality has a maximum at some non-zero 
value, the bin size of a percentile distribution near the maxi- 
mum will necessarily be small. The accuracy with which au- 
thors can be assigned to a given bin in the region around the 
maximum is reduced since one is attempting to distinguish 



between authors with very similar citation distributions. As 
a result, the statistical accuracy of percentile assignments is 
high at the extremes and relatively low in the middle of the 
distribution where we are attempting to make fine distinctions 
between scientists of similar ability. This effect is illustrated 
in Fig.|6] 
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FIG. 7: The probability that a typical (i.e., most probable) author 
with 50 published papers will be assigned to the correct decile as a 
function of actual author decile. The median number of citations is 
used as a measure. 



VI. SCALING 



65th percentile citation rate as a measure (Similar results are 
obtained when using the mean or median citation rates). The 
figure indicates that N = 50 papers is more than sufficient to 
identify authors in the first and tenth deciles. In fact, approxi- 
mately 25 and 20 papers respectively are sufficient to place au- 
thors in these deciles at the 90% confidence level. Fig.|7]also 
indicates that « 50 published papers are sufficient to make 
meaningful assignments of authors to the second, third, and 
ninth deciles. All measures have difficulty in assigning au- 
thors to deciles 5 — 8. As indicated by the small values of the 
KL divergence in these bins for all measures considered, the 
citation distributions of these authors are simply too similar 
to permit accurate discrimination (see arguments in the previ- 
ous section). On the other hand, the probability that an author 
can be correctly assigned to one of these middle bins on the 
basis of 50 publication is high. This difficulty is due to the 
relatively small range of citations ranges which cover these 
bins: the 65th percentile-bins 5 though 8 contain authors with 
a 65th percentile between 5 and 13 citations (cf. the narrow 
ranges of the middle bins in the case of the mean, displayed 
in TableHD. 



In this section, we consider the question of how many pub- 
lished papers are required in order to make a reliable predic- 
tion of the percentile ranking of a given author. (We consider 
results only using the 65th percentile measure.) If this num- 
ber is sufficiently small, analysis along the lines presented 
here can provide a practical tool of potential value in predict- 
ing long-term scientific performance. In order to address this 
question, we will consider how P(a|{n,}) scales as a function 
of the total number of publications for an average author in 
each bin. Assume that an average author belonging to bin a 
draws N papers at random from the distribution of P(i\tt) . The 
most probable number of papers in each citation bin will thus 
be given as n, = NP(i\a). Inserting this result into Eq. (O and 
discarding all fixed factors, we find that 

P(a\{n,}) ~ P (a) (j]P(i\a) p ^ . (8) 

For the same citation record, {«,}, a similar expression per- 
mits determination of the probability that this average author 
will be assigned to any bin, (3. We see that 

This equation illustrates the utility of the KL divergence and 
explains the origin of its lack of symmetry. It is clear from 
Eqs. ([8]l and (0 that the probability of assigning this average 
author to the wrong bin will ultimately vanish exponentially 
with N. Given enough papers, the largest bin will ultimately 
dominate. 

To obtain a quantitative sense of how many papers are re- 
quired in practice, we pose the following question: What is 
the probability that a typical author from each author decile 
with N = 50 published papers will be assigned to the correct 
decile? The answer is plotted as a histogram in Fig.|7]using the 



VII. CONCLUSIONS 

There are two distinct questions which must be addressed 
in any attempt to use citation data as an indication of author 
quality. The first is whether the measure chosen to character- 
ize a given citation distribution or even the citation distribu- 
tion itself reflects the qualities that we would like to probe. 
The second question is whether a given measure is capable of 
discriminating between authors in a statistically reliable way 
and, by extension, which of several measures is best. We have 
shown that the use of Bayesian statistics and the Kullback- 
Leibler divergence can answer this question in a value-neutral 
and statistically compelling manner. It is possible to draw 
reliable conclusions regarding an author's citation record on 
the basis of approximately 50 papers, and it is possible to as- 
sign meaningful statistical uncertainties to the results. The 
high level of discrimination obtained in the highest and low- 
est deciles provides indirect support for our assumption that 
an author's citation record is drawn at random from an appro- 
priate conditional distribution and suggests that possible addi- 
tional correlations in citation data are not important. Further, 
the difficulty in discriminating between authors in the middle 
deciles suggests that intrinsic author ability is peaked at some 
non-zero value. 

The probabilistic methods adopted here permit meaningful 
comparison of scientists working in distinct areas with only 
minimal value judgments. It seems fair, for example, to de- 
clare equality between scientists in the same percentile of their 
peer groups. It is similarly possible to combine probabilities 
in order to assign a meaningful ranking to authors with publi- 
cations in several disjoint areas. All that is required is knowl- 
edge of the conditional probabilities appropriate for each ho- 
mogeneous subgroup. 

We note, however, that the number of publications required 
to make meaningful author assignments is large enough to 



limit the utility of such analyses in the academic appointment 
process. This raises the question of whether there are more ef- 
ficient measures of an author's full citation record than those 
considered here. Our object has been to find that measure 
which is best able to assign the most similar authors together. 
Straightforward iterative schemes can be constructed to this 
end and are found to converge rapidly (i.e., exponentially) 
to an optimal binning of authors. (The result is optimal in 
the sense that it maximizes the sum of the KL divergences, 
KL[P(»\a),P(»\$)], over all a and p.) The results are only 
marginally better than those obtained here with the mean, me- 
dian or 65th percentile measures. 

Finally, it is also important to recognize that it takes time for 
a paper to accumulate its full complement of citations. While 
their are indications that an author's early and late publications 
are drawn (at random) on the same conditional distribution 
ifTTll . many highly cited papers accumulate citations at a con- 
stant rate for many years after their publication. This effect, 
which has not been addressed in the present analysis, repre- 
sents a serious limitation on the value of citation analyses for 
younger authors. The presence of this effect also poses the ad- 
ditional question of whether there are other kinds of statistical 
publication data that can deal with this problem. Co-author 
linkages may provide a powerful supplement or alternative to 
citation data. (Preliminary studies of the probability that au- 
thors in bins a and (3 will co-author a publication reveal a 
striking concentration along the diagonal a = p.) Since each 
paper is created with its full set of co-authors, such informa- 
tion could be useful in evaluating younger authors. This work 
will be reported elsewhere. 



APPENDIX A: VERTICAL STRIPES 

The most striking feature of the calculated P(p|a) shown in 
Fig. 4 is presence of vertical 'stripes'. These stripes are most 
pronounced for the poorest measures and disappear as the re- 
liability of the measure improves. Here, we offer a schematic 
but qualitatively reliable explanation of this phenomenon. To 
this end, imagine that each author's citation record is actually 
drawn at random on the true distributions F° r sim- 

plicity, assume that every author has precisely N publications, 
that each author in true class A has the same distribution of 
citations with nf = NQ(i\A), and that there are equal num- 
bers of authors in each true author class. These authors are 
then distributed into author bins, a, according to some cho- 
sen quality measure. The methods of Sections IV and V can 

then be used to determine P(i\a), P({nj A) }|P), P(p|{n| A) }) 

and P(p|oc). Given the form of the nj A ^ and assuming that N 
is large, we find that 



(a) Papers/year P(a'\a) 



(b) Papers/year P(a'\a) 



P(p|{«H) « exp(-JVKL[j2(.|A),P(.|P)]) (Al) 



and 



P(P|a)~£P(A|a)exp(-iVKL[!2(.|A),P(.|P)]) , (A2) 

A 

where P(A\a) is the probability that the citation record of an 
author assigned to class a was actually drawn on Q(i\A), The 



10 


g ■ 




■ ■ 


10 














9 






■ ■ 


9 










i ■ 




s 


g ■ 




■ 


8 




■ 






B ■ 




7 


| ■ 


■ ■ ■ ■ 


■ ■ 


7 


■ 


■ 




■ ■ 


■ ■ 




6 

>< 

5 


■ ■ 

■ ■ 


■ ■ ■ ■ 

■ ■ ■ ■ 


■ ■ ■ 

■ ■ 


6 

o 

5 


■ 
■ 


■ 
■ 


■ ■ 


■ ■ 

■ ■ 


■ ■ 

■ ■ 


■ ■ 


4 


■ ■ 




■ ■ ■ 


4 


■ 


■ 


■ ■ 


■ ■ 


■ ■ 




3 


■ ■ 




■ - - - 


3 


■ 


■ 


■ ■ 


■ ■ 


■ ■ 




2 


■ ■ 


■ ■ ■ ■ 


■ - - - 


2 


■ 


■ 


■ 


■ ■ 


■ 




1 


■ ■ 


■ ■ ■ ■ 


■ - - - 


1 


■ 


■ 


■ ■ 




■ 






1 2 


3 4 5 fi 


7 8 9 10 




1 


2 


3 4 


V 


7 £ 


9 10 



FIG. 8: A comparison of the approximate P(p|a) from Eq. dA2t and 
the exact /'(p > |cx) for the papers published per year measure. 



results of this approximate evaluation are shown in Fig. 8 and 
compared with the exact values of P(P|oc) for the papers per 
year measure. The approximations do not affect the qualita- 
tive features of interest. 

We now assume that the measure defining the author bins, 
a, provides a poor approximation to the true bins, A. In this 
case, authors will be roughly uniformly distributed, and the 
factor P(A\a) appearing in Eq. (A2) will not show large vari- 
ations. Significant structure will arise from the exponential 
terms, where the presence of the factor N (assumed to be 
large), will amplify the differences in the KL divergences. The 
KL divergence will have a minimum value for some value of 
A = Ao(P), and this single term will dominate the sum. Thus, 
P(P|oc) reduces to 

P(P|cc) ~P(A |a)exp(-iVKL[e(«|Ao),P(«|P)]) . (A3) 

The vertical stripes prominent in Figs. 4(a) and (b) emerge as 
a consequence of the dominant p-dependent exponential fac- 
tor. The present arguments also apply to the worst possible 
measure, i.e., a completely random assignment of authors to 
the bins a. In the limit of a large number of authors, A^ut, 
all P(/|p) will be equal except for statistical fluctuations. The 
resulting KL divergences will respond linearly to these fluc- 
tuations. 12 These fluctuations will be amplified as before pro- 
vided only that N au( grows less rapidly than A^ 2 . The argument 
here does not apply to good measures where there is signif- 
icant structure in the term P(A\a). (For a perfect measure, 
P(A\a) — 5a<x-) In the case of good measures, the expected 
dominance of diagonal terms (seen in the lower row of Fig. 4) 
remains unchallenged. 



APPENDIX B: EXPLICIT DISTRIBUTIONS 

For convenience we present all data to determine the prob- 
abilities P(a| {«,-}) for authors who publish in the theory sub- 
section of SPIRES. Data is presented only for case of the mean 



12 This is true because there will be no choice of A such that Q(i\A) = P(i\a). 
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F 


(jjCC) 


P(N\a) 


Bin number 


Citation range 


Bin Number 


Total paper range 


i=l 


k= 1 


m = 1 


N=25 


i = 2 


k=2 


m = 2 


yV=26 


7 = 3 


2<k<4 


m = 3 


26<JV<28 


7 = 4 


4 </c< 8 


771 = 4 


28 <N< 32 


7 = 5 


8 < fc< 16 


m = 5 


32 < /V < 40 


7 = 6 


16 <k< 32 


m = 6 


40 < N < 56 


7 = 7 


32 < ac < 64 


777, = 7 


56<JV<88 


7 = 8 


6A<k< 128 


771 = 8 


88<N< 152 


7 = 9 


128 <k< 256 


m = 9 


152<JV<AW X 


( = 10 


256 <fc< 512 






7=11 


512<Tc</V max 







TABLE I: The binning of citations and total number of papers. The 
first and second column show the bin number and bin ranges for the 
citation bins used to determine the conditional citation probabilities 
P(i\ct.) for each a, shown in Table Hill The third and fourth column 
display the bin number and total number of paper ranges used in the 
creation of the conditional probabilities y°(m|a) for each a, displayed 
in TablellVl 



number of citations. All citations are binned logarithmically 
according to the citation bins listed in column one and two 
of Table Q] The author bins are determined on the basis of 
deciles of the total distribution of mean citations, p((k)). Ta- 
ble [n] shows the relevant quantities for these bins. Given the 
definitions of both the author- and citation bins, we can deter- 
mine the conditional citation distributions P(i\cc) empirically. 
These are given in Table Hill 



a 


(k) -range 


# authors 


p{a) 


77(a) 


1 


- 1.69 


673 


0.1 


37.0 


2 


1.69 - 3.08 


673 


0.1 


41.8 


3 


3.08 - 4.88 


675 


0.1 


44.0 


4 


4.88 - 6.94 


673 


0.1 


46.8 


5 


6.94 - 9.40 


674 


0.1 


52.2 


6 


9.40 - 12.56 


674 


0.1 


54.3 


7 


12.56- 16.63 


673 


0.1 


59.5 


8 


16.63- 22.19 


674 


0.1 


59.0 


9 


22.19- 33.99 


674 


0.1 


65.4 


10 


33.99-285.88 


674 


0.1 


72.2 



TABLE II: The author bins. This table shows the mean numbers of 
citations that define the limits of the 10 author bins. 
13 This fact is known as Lotka's Law 1221. 



We also need the probabilities P(N\a) describing that an 
author in bin a has N publications. Because of the low num- 
ber of authors in each bin, we need to bin the total number of 
publications when calculating this probability; we use the let- 
ter m to enumerate the A^-bins. Because P(N\a) is described 
by a power-law distribution 13 and since we only consider au- 
thors with more than 25 publications, we choose to bin N log- 
arithmically as displayed in the third and fourth column of 
Table U The conditional probabilities, P(m\a) are displayed 
inTableHV] 
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0.177 
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0.068 
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TABLE III: The distributions P(i\a). This table displays the conditional probabilities that an author writes a paper in paper-bin given that his 
author-bin is a. 
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TABLE IV: The conditional probabilities P(m\a). This table contains the conditional probabilities that an author has a total number of 
publications in publication-bin m given that his author-bin is a . 
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